Exploiting the Leipzig Corpora Collection
نویسندگان
چکیده
In this paper the Leipzig Corpora Collection is introduced as a contribution to the idea that there is need for standardization of multilingual language resources. We explain the steps of building, processing and presenting corpora of comparable sizes and in a uniform format. Results from intraand interlingual comparisons of corpora are given and methods that can build upon these corpora
منابع مشابه
Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages
The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched with statistical information. In this paper we describe current advances of the project in collecting and processing text data automatically for a large number of languages. Our main interest lies in languages of “low density”, where only few text data exists online. The aim of this approach is to ...
متن کاملStandardized Multilingual Language Resourcesfor the Web of Data
Statistical knowledge on natural languages is inevitable for various kinds of services requiring Natural Language Processing (NLP) functionality, such as information retrieval. The NLP Group at the University of Leipzig started providing such statistical information for more than 50 languages in the Leipzig Corpora Collection (LCC) [1] more than a decade ago. Some of their corpora contain more ...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملUsing Significant Word Co-occurences for the Lexical Access Problem
One way to analyse word relations is to examine their co-occurrence in the same context. This allows for the identification of potential semantic or lexical relationships between words. As previous studies showed word co-occurrences often reflect human stimuli-response pairs. In this paper significant sentence co-occurrences on word level were used to identify potential responses for word stimu...
متن کاملASV Toolbox: a Modular Collection of Language Exploration Tools
ASV Toolbox is a modular collection of tools for the exploration of written language data both for scientific and educational purposes. It includes modules that operate on word lists or texts and allow to perform various linguistic annotation, classification and clustering tasks, including language detection, POS–tagging, base form reduction, named entity recognition, and terminology extraction...
متن کامل